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Abstract 

We present a novel distributed evolutionary algorithm, KaFFPaE, to solve the Graph Partitioning Problem, 
which makes use of KaFFPa (Karlsruhe Fast Flow Partitioner). The use of our multilevel graph partitioner 
KaFFPa provides new effective crossover and mutation operators. By combining these with a scalable commu- 
nication protocol we obtain a system that is able to improve the best known partitioning results for many inputs 
in a very short amount of time. For example, in Walshaw's well known benchmark tables we are able to improve 
or recompute 76% of entries for the tables with 1%, 3% and 5% imbalance. 



1 Introduction 

Problems of graph partitioning arise in various areas of computer science, engineering, and related fields. For 
example in high performance computing f27l, community detection in social networks |25l and route planning Q. 
In particular the graph partitioning problem is very valuable for parallel computing. In this area, graph partitioning 
is mostly used to partition the underlying graph model of computation and communication. Roughly speaking, 
vertices in this graph represent computation units and edges denote communication. This graph needs to be parti- 
tioned such that there are few edges between the blocks (pieces). In particular, if we want to use k processors we 
want to partition the graph into k blocks of about equal size. 

In this paper we focus on a version of the problem that constrains the maximum block size to (1 + e) times the 
average block size and tries to minimize the total cut size, i.e., the number of edges that run between blocks. It is 
well known that this problem is NP-complete fT] and that there is no approximation algorithm with a constant ratio 
factor for general graphs |7|. Therefore mostly heuristic algorithms are used in practice. 

A successful heuristic for partitioning large graphs is the multilevel graph partitioning (MGP) approach de- 
picted in Figure [T] where the graph is recursively contracted to achieve smaller graphs which should reflect the 
same basic structure as the input graph. After applying an initial partitioning algorithm to the smallest graph, the 
contraction is undone and, at each level, a local refinement method is used to improve the partitioning induced by 
the coarser level. 

The main focus of this paper is a technique which integrates an evolutionary search algorithm with our multi- 
level graph partitioner KaFFPa and its scalable parallelization. We present novel mutation and combine operators 
which in contrast to previous methods that use a graph partitioner [28^, W\\ do not need random perturbations of 
edge weights. We show in Section|6]that the usage of edge weight perturbations decreases the overall quality of the 
underlying graph partitioner. The new combine operators enable us to combine individuals of different kinds (see 
Section |4] for more details). Due to the parallelization our system is able to compute partitions that have quality 
comparable or better than previous entries in Walshaw's well known partitioning benchmark within a few minutes 
for graphs of moderate size. Previous methods of Soper et.al [28] required runtimes of up to one week for graphs 
of that size. We therefore believe that in contrast to previous methods, our method is very valuable in the area of 
high performance computing. 

The paper is organized as follows. We begin in Section [2] by introducing basic concepts. After shortly pre- 
senting Related Work in Section [3} we continue describing the main evolutionary components in Section [4] and its 



parallelization in Section [5] A summary of extensive experiments 
done to tune the algorithm and evaluate its performance is pre- 
sented in Section [6] A brief outline of the techniques used in the 
multilevel graph partitioner KaFFPa is provided in Appendix |A] 
We have implemented these techniques in the graph partitioner 
KaFFPaE (Karlsruhe Fast Flow Partitioner Evolutionary) which is 
written in C++. Experiments reported in Section [6] indicate that 
KaFFPaE is able to compute partitions of very high quality and 
scales well to large networks and machines. 




Figure 1 : Multilevel graph partitioning. 



2 Preliminaries 
2.1 Basic concepts 

Consider an undirected graph G = {V, E, c, io) with edge weights uj : E ^ M>o, node weights c : V ^ IR>o^ 
n = \V\, and m = \E\. We extend c and u to sets, i.e., c{V') := X^^gy/ c(t') and uj{E') := J2eeE' ^i^)- 
T{v) := {u : {v, u} E E} denotes the neighbors of v. We are looking for blocks of nodes Vi,. . . ,Vk that partition 
V, i.e., ViVJ ■ ■ ■ VJVk = V and Vi r\Vj = % for i ^ j. The balancing constraint demands that Vi G : 
ciVi) < -^^max := (1 + (-)c(y)/k + max„gy c(v) for some parameter e. The last term in this equation arises 
because each node is atomic and therefore a deviation of the heaviest node has to be allowed. The objective is to 
minimize the total cut X^j^j w{Eij) where Eij := {{u, v} ^ E : u G Fj, v G Vj}. A clustering is also a partition 
of the nodes, however k is usually not given in advance and the balance constraint is removed. A vertex v ^ Vi 
that has a neighbor w £ Vj,i j, is a. boundary vertex. An abstract view of the partitioned graph is the so called 
quotient graph, where vertices represent blocks and edges ai^e induced by connectivity between blocks. Given 
two clusterings Ci and C2 the overlay clustering is the clustering where each block corresponds to a connected 
component of the graph Gs = {V, E\£) where £ is the union of the cut edges of Ci and C2, i.e. all edges that run 
between blocks in either Ci or €2- By default, our initial inputs will have unit edge and node weights. However, 
even those will be translated into weighted problems in the course of the algorithm. 

A matching M C is a set of edges that do not share any common nodes, i.e., the graph {V, M) has maximum 
degree one. Contracting an edge {u, v} means to replace the nodes u and u by a new node x connected to the 
former neighbors of u and v. We set c{x) = c{u) + c{v) so the weight of a node at each level is the number of 
nodes it is representing in the original graph. If replacing edges of the form {u,w},{v,w} would generate two 
parallel edges {x, w}, we insert a single edge with uj{{x, w}) = uj{{u, w}) + uj{{v, w}). Uncontracting an edge 
e undos its contraction. In order to avoid tedious notation, G will denote the current state of the graph before and 
after a (un)contraction unless we explicitly want to refer to different states of the graph. The multilevel approach 
to graph partitioning consists of three main phases. In the contraction (coarsening) phase, we iteratively identify 
matchings M (1 E and contract the edges in M. Contraction should quickly reduce the size of the input and each 
computed level should reflect the global structure of the input network. Contraction is stopped when the graph is 
small enough to be directly partitioned using some expensive other algorithm. In the refinement (or uncoarsening) 
phase, the matchings are iteratively uncontracted. After uncontracting a matching, a refinement algorithm moves 
nodes between blocks in order to improve the cut size or balance. 

KaFFPa, which we use as a base case partitioner, extended the concept of iterated multilevel algorithms which 
was introduced by |[29l . The main idea is to iterate the coarsening and uncoarsening phase. Once the graph is 
partitioned, edges that are between two blocks are not contracted. An F-cycle works as follows: on each level we 
perform at most two recursive calls using different random seeds during contraction and local search. A second 
recursive call is only made the second time that the algorithm reaches a particular level. As soon as the graph is 
partitioned, edges that are between blocks are not contracted. This ensures nondecreasing quality of the partition 
since our refinement algorithms guarantee no worsening and break ties randomly. These so called global search 
strategies are more effective than plain restarts of the algorithm. Extending this idea will yield the new combine 
and mutation operators described in Section |4] 
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Local search algorithms find good solutions in a very short amount of time but often get stuck in local optima. 
In contrast to local search algorithms, genetic/evolutionary algorithms are good at searching the problem space 
globally. However, genetic algorithms lack the ability of fine tuning a solution, so that local search algorithms can 
help to improve the performance of a genetic algorithm. The combination of an evolutionary algorithm with a local 
search algorithm is called hybrid or memetic evolutionary algorithm [20.1 . 

3 Related Work 

There has been a huge amount of research on graph partitioning so that we refer the reader to ITSl [311 for more 
material on multilevel graph partitioning and to [20] for more material on genetic approaches for graph partitioning. 
All general purpose methods that are able to obtain good partitions for large real world graphs are based on the mul- 
tilevel principle outlined in Section [2] Well known software packages based on this approach include, Jostle lOTll . 
Metis |[T9l . and Scotch |24|. KaFFPa 1 17] is a MGP algorithm using local improvement algorithms that are based 
on flows and more localized FM searches. It obtained the best results for many graphs in [28]. Since we use it as 
a base case partitioner it is described in more detail in Appendix |A] KaSPar |[23l is a graph partitioner based on 
the central idea to (un)contract only a single edge between two levels. KaPPa ifTTl is a "classical" matching based 
MGP algorithm designed for scalable parallel execution. 

Soper et al. [28] provided the first algorithm that combined an evolutionary search algorithm with a multilevel 
graph partitioner. Here crossover and mutation operators have been used to compute edge biases, which yield 
hints for the underlying multilevel graph partitioner. Benlic et al. ||5l provided a multilevel memetic algorithm for 
balanced graph partitioning. This approach is able to compute many entries in Walshaw's Benchmark Archive 1281 
for the case e = 0. PROBE [8| is a meta-heuristic which can be viewed as a genetic algorithm without selection. It 
outperforms other metaheuristics, but it is restricted to the case k = 2 and e = 0. 

Very recently an algorithm called PUNCH IfTTl has been introduced. This approach is not based on the multi- 
level principle. However, it creates a coarse version of the graph based on the notion of natural cuts. Natural cuts 
are relatively sparse cuts close to denser areas. They are discovered by finding minimum cuts between carefully 
chosen regions of the graph. They introduced an evolutionary algorithm which is similar to Soper et al. [28 1, i.e. 
using a combine operator that computes edge biases yielding hints for the underlying graph partitioner. Experi- 
ments indicate that the algorithm computes very good partitions for road networks. For instances without a natural 
structure such as road networks, natural cuts are not very helpful. 

4 Evolutionary Components 

The general idea behind evolutionary algorithms (EA) is to use mechanisms which are highly inspired by biological 
evolution such as selection, mutation, recombination and survival of the fittest. An EA starts with a population of 
individuals (in our case partitions of the graph) and evolves the population into different populations over several 
rounds. In each round, the EA uses a selection rule based on the fitness of the individuals (in our case the edge 
cut) of the population to select good individuals and combine them to obtain improved offspring lfT6l . Note that 
we can use the cut as a fitness function since our partitioner almost always generates partitions that are within the 
given balance constraint, i.e. there is no need to use a penalty function or something similar to ensure that the 
final partitions generated by our algorithm are feasible. When an offspring is generated an eviction rule is used 
to select a member of the population and replace it with the new offspring. In general one has to take both into 
consideration, the fitness of an individual and the distance between individuals in the population [2J. Our algorithm 
generates only one offspring per generation. Such an evolutionary algorithm is called steady -state ||9l. A typical 
structure of an evolutionary algorithm is depicted in Algorithm [T] 

For an evolutionary algorithm it is of major importance to keep the diversity in the population high fT\, i.e. 
the individuals should not become too similar, in order to avoid a premature convergence of the algorithm. In 
other words, to avoid getting stuck in local optima a procedure is needed that randomly perturbs the individuals. 
In classical evolutionary algorithms, this is done using a mutation operator. It is also important to have operators 
that introduce unexplored search space to the population. Through a new kind of crossover and mutation operators, 
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introduced in Section [4~T] we introduce more elaborate diversification strategies which allow us to search the search 
space more effectively. 

Interestingly, Inayoshi et al. ifTSll noticed that good local solutions of the graph partitioning problem tend to be 
close to one another. Boese et al. f6l showed that the quality of the local optima overall decreases as the distance 
from the global optimum increases. We will see in the following that our combine operators can exchange good 
parts of solutions quite effectively especially if they have a small distance. 



Algorithm 1 A classic general steady-state evolutionary algorithm, 
procedure steady-state-EA 
create initial population P 
while stopping criterion not fulfilled 
select parents pi , p2 from P 
combine pi with p2 to create offspring o 
mutate offspring o 

evict individual in population using o 
return the fittest individual that occurred 



4.1 Combine Operators 

We now describe the general combine operator framework. This is followed by three instantiations of this frame- 
work. In contrast to previous methods that use a multilevel framework our combine operators do not need pertur- 
bations of edge weights since we integrate the operators into our partitioner and do not use it as a complete black 
box. 

Furthermore all of our combine operators assure that the offspring has a partition quality at least as good as the 
best of both parents. Roughly speaking, the combine operator framework combines an individual/partition V = 
Vi',...,V^ (which has to fulfill a balance constraint) with a clustering C = Vi,...,V^,. Note that 
the clustering does not necessarily has to fulfill a balance constraint and 
k' is not necessarily given in advance. All instantiations of this frame- 
work use a different kind of clustering or partition. The partition and 
the clustering are both used as input for our multi-level graph partitioner 
KaFFPa in the following sense. Let £ be the set of edges that are cut 
edges, i.e. edges that run between two blocks, in either VorC. All edges 
in £ are blocked during the coarsening phase, i.e. they are not contracted 
during the coarsening phase. In other words these edges are not eligible 
for the matching algorithm used during the coarsening phase and there- 
fore are not part of any matching computed. An illustration of this can 
be found in Figure |2] 

The stopping criterion for the multi-level partitioner is modified such 
that it stops when no contractable edge is left. Note that the coarsest 
graph is now exactly the same as the quotient graph Q! of the overlay 
clustering of V and C of G (see Figure [3]l. Hence vertices of the coarsest 
graph correspond to the connected components of = {V, E\£) and 
the weight of the edges between vertices corresponds to the sum of the 
edge weights running between those connected components in G. 

As soon as the coarsening phase is stopped, we apply the partition 
V to the coarsest graph and use this as initial partitioning. This is pos- 
sible since we did not contract any cut edge of V. Note that due to the 
specialized coarsening phase and this specialized initial partitioning we 
obtain a high quality initial solution on a very coarse graph which is usually not discovered by conventional parti- 
tioning algorithms. Since our refinement algorithms guarantee no worsening of the input partition and use random 




Figure 2: On the top a graph G with two 
partitions, the dark and the light line, 
are shown. Cut edges are not eligible 
for the matching algorithm. Contraction 
is done until no matchable edge is left. 
The best of the two given partitions is 
used as initial partition. 
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Figure 3: A graph G and two bipartitions; the dotted and the dashed hne (left). Curved lines represent a large 
cut. The four vertices correspond to the coarsest graph in the multilevel procedure. Local search algorithms can 
effectively exchange V2 or f 4 to obtain the better partition depicted on the right hand side (dashed line). 

tie breaking we can assure nondecreasing partition quality. Note that the refinement algorithms can effectively 
exchange good parts of the solution on the coarse levels by moving only a few vertices. Figure |3]gives an example. 

Also note that this combine operator can be extended to be a multi-point combine operator, i.e. the operator 
would use p instead of two parents. However, during the course of the algorithm a sequence of two point combine 
steps is executed which somehow "emulates" a multi-point combine step. Therefore, we restrict ourselves to the 
case p = 2. When the offspring is generated we have to decide which solution should be evicted from the current 
population. We evict the solution that is most similar to the offspring among those individuals in the population 
that have a cut worse or equal than the offspring itself. The difference of two individuals is defined as the size of 
the symmetric difference between their sets of cut edges. This ensures some diversity in the population and hence 
makes the evolutionary algorithm more effective. 

4.1.1 Classical Combine using Tournament Selection 

This instantiation of the combine framework corresponds to a classical evolutionary combine operator Ci. That 
means it takes two individuals Pi , P2 of the population and performs the combine step described above. In this 
case V corresponds to the partition having the smaller cut and C corresponds to the partition having the larger cut. 
Random tie breaking is used if both parents have the same cut. The selection process is based on the tournament 
selection rule [22], i.e. Pi is the fittest out of two random individuals Ri,R2 from the population. The same is 
done to select P2. Note that in contrast to previous methods the generated offspring will have a cut smaller or equal 
to the cut of V. Due to the fact that our multi-level algorithms are randomized, a combine operation performed 
twice using the same parents can yield different offspring. 

4.1.2 Cross Combine / (Transduction) 

In this instantiation of the combine framework C2, the clustering C corresponds to a partition of G. But instead 
of choosing an individual from the population we create a new individual in the following way. We choose k' 
uniformly at random in [k/A, Ak] and e' uniformly at random in [e, 4e]. We then use KaFFPa to create a fc'-partition 
of G fulfilling the balance constraint maxc(Fi) < (1 + e')c{V)/k'. In general larger imbalances reduce the cut of 
a partition which then yields good clusterings for our crossover. To the best of our knowledge there has been no 
genetic algorithm that performs combine operations combining individuals from different search spaces. 

4.1.3 Natural Cuts 

Delling et al. ifTTI introduced the notion of natural cuts as a preprocessing technique for the partitioning of 
road networks. The preprocessing technique is able to find relatively sparse cuts close to denser areas. We use 
the computation of natural cuts to provide another combine operator, i.e. combining a A; -partition with a clus- 
tering generated by the computation of natural cuts. We closely follow their description: The computation of 
natural cuts works in rounds. Each round picks a center vertex v and grows a breadth-first search (BFS) tree. 
The BFS is stopped as soon as the weight of the tree, i.e. the sum of the vertex weights of the tree, reaches 
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all, for some parameters a and U. The set of the neighbors of T in V\T is 
called the ring of v. The core of v is the union of all vertices added to T before 
its size reached all / f where / > 1 is another parameter. 

The core is then temporarily contracted to a single vertex s and the ring 
into a single vertex t to compute the minimum s-t-cut between them using the 
given edge weights as capacities. 

To assure that every vertex eventually belongs to at least one core, and 
therefore is inside at least one cut, the vertices v are picked uniformly at ran- 
dom among all vertices that have not yet been part of any core in any round. 
The process is stopped when there are no such vertices left. 

In the original work ifTTl each connected component of the graph Gc = 
{V, E\C), where C is the union of all edges cut by the process above, is con- 
tracted to a single vertex. Since we do not use natural cuts as a preprocessing 
technique at this place we don't contract these components. Instead we build 
a clustering C of G such that each connected component of Gc is a block. 

This technique yields the third instantiation of the combine framework C3 
which is divided into two stages, i.e. the clustering used for this combine step 
is dependent on the stage we are currently in. In both stages the partition V 
used for the combine step is selected from the population using tournament 
selection. During the first stage we choose / uniformly at random in [5, 20], 
a uniformly at random in [0.75, 1.25] and we set U = \V\/3k. Using these 
parameters we obtain a clustering C of the graph which is then used in the 
combine framework described above. This kind of clustering is used until we 
reach an upper bound of ten calls to this combine step. When the upper bound 
is reached we switch to the second stage. In this stage we use the clusterings computed during the first stage, i.e. 
we extract elementary natural cuts and use them to quickly compute new clusterings. An elementary natural cut 
(ENC) consists of a set of cut edges and the set of nodes in its core. Moreover, for each node v in the graph, we store 
the set of of ENCs N{v) that contain v in their core. With these data structures its easy to pick a new clustering C 
(see Algorithm [2]) which is then used in the combine framework described above. 

Algorithm 2 computeNaturalCutClustering (second stage) 
1 : unmarked all nodes in V 
2: for each v in random order do 
3: if V is not marked tlien 
4: pick a random ENC C in A^(w) 
5: output C 

6: mark all nodes in C's core 




Figure 4: On the top we see the 
computation of a natural cut. A 
BFS Tree which starts from v is 
grown. The gray area is the core. 
The dashed line is the natural cut. 
It is the minimum cut between 
the contracted versions of the core 
and the ring (shown as the solid 
line). During the computation sev- 
eral natural cuts are detected in the 
input graph (bottom). 



4.2 Mutation Operators 

We define two mutation operators, an ordinary and a modified F-cycle. Both mutation operators use a random 
individual from the current population. The main idea is to iterate coarsening and refinement several times using 
different seeds for random tie breaking. The first mutation operator Mi can assure that the quality of the input 
partition does not decrease. It is basically an ordinary F-cycle which is an algorithm used in KaFFPa. Edges 
between blocks are not contracted. The given partition is then used as initial partition of the coarsest graph. In 
contrast to KaFFPa, we now can use the partition as input to the partition in the very beginning. This ensures 
nondecreasing quality since our refinement algorithms guarantee no worsening. The second mutation operator M2 
works quite similar with the small difference that the input partition is not used as initial partition of the coarsest 
graph. That means we obtain very good coarse graphs but we can not assure that the final individual has a higher 
quality than the input individual. In both cases the resulting offspring is inserted into the population using the 



eviction strategy described in Section 4.1 
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5 Putting Things Together and Parallelization 



We now explain the parallelization and describe how everything is put together. Each processing element (PE) 
basically performs the same operations using different random seeds (see Algorithm [3]). First we estimate the 
population size S: each PE performs a partitioning step and measures the time t spend for partitioning. We then 
choose S such that the time for creating S partitions is approximately ttotai// where the fraction / is a tuning 
parameter and ttotai is the total running time that the algorithm is given to produce a partition of the graph. Each PE 
then builds its own population, i.e. KaFFPa is called several times to create S individuals/partitions. Afterwards 
the algorithm proceeds in rounds as long as time is left. With corresponding probabilities, mutation or combine 
operations are performed and the new offspring is inserted into the population. 

We choose a parallelization/communication protocol that is quite similar to randomized rumor spreading |[T2l . 
Let p denote the number of PEs used. A communication step is organized in rounds. In each round, a PE chooses 
a communication partner and sends her the currently best partition P of the local population. The selection of the 
communication partner is done uniformly at random among those PEs to which P not already has been send to. 
Afterwards, a PE checks if there are incoming individuals and if so inserts them into the local population using the 
eviction strategy described above. If P is improved, all PEs are again eligible. This is repeated \ogp times. Note 
that the algorithm is implemented completely asynchronously, i.e. there is no need for a global synchronisation. 
The process of creating individuals is parallelized as follows: Each PE makes s' = \S\/p calls to KaFFPa using 
different seeds to create s' individuals. Afterwards we do the following S — s' times: The root PE computes a 
random cyclic permutation of all PEs and broadcasts it to all PEs. Each PE then sends a random individual to its 
successor in the cyclic permutation and receives a individual from its predecessor in the cyclic permutation. We 
call this particular part of the algorithm quick start. 

The ratio : of mutation to crossover operations yields a tuning parameter c. As we will see in Section [6] 
the ratio 1 : 9 is a good choice. After some experiments we fixed the ratio of the mutation operators Mi : M2 to 
4 : 1 and the ratio of the combine operators Ci : C2 : C3 to 3 : 1 : 1. 

Note that the communication step in the last line of the algorithm could also be performed only every x- 
iterations (where x is a tuning parameter) to save communication time. Since the communication network of our 
test system is very fast (see Section[6]), we perform the communication step in each iteration. 



Algorithm 3 All PEs perform basically the same operations using different random seeds, 
procedure locallyEvolve 
estimate population size S 
while time left 

if elapsed time < t^oiax/ f then create individual and insert into local population 
else 

flip coin c with corresponding probabilities 
if c shows head then 

perform a mutation operation 
else 

perform a combine operation 
insert offspring into population if possible 
communicate according to communication protocol 



6 Experiments 

Implementation. We have implemented the algorithm described above using C++. Overall, our program (in- 
cluding KaFFPa) consists of about 22 500 lines of code. We use two base case partitioners, KaFFPaStrong and 
KaFFPaEco. KaFFPaEco is a good tradeoff between quality and speed, and KaFFPaStrong is focused on quality. 
For the following comparisons we used Scotch 5. 1.9., and kMetis 5.0 (pre2). 
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System. Experiments have been done on two machines. Machine A is a cluster with 200 nodes where each node 
is equipped with two Quad-core Intel Xeon processors (X5355) which run at a clock speed of 2.667 GHz. Each 
node has 2x4 MB of level 2 cache each and run Suse Linux Enterprise 10 SP 1. All nodes are attached to an 
InfiniBand 4X DDR interconnect which is characterized by its very low latency of below 2 microseconds and a 
point to point bandwidth between two nodes of more than 1300 MB/s. Machine B has two Intel Xeon X5550, 
48GB RAM, running Ubuntu 10.04. Each CPU has 4 cores (8 cores when hyperthreading is active) running at 
2.67 GHz. Experiments in Sections 6.1 6.2[ 6.3 and 6.5 have been conducted on machine A, and experiments in 
Sections 6.4 and |6.6| have been conducted on machine B. All programs were compiled using GCC Version 4.4.3 
and optimization level 3 using OpenMPI 1.5.3. Henceforth, a PE is one core. 



Instances. We report experiments on three suites of instances (small, medium sized and road networks) summa- 
rized in Appendixjc] rggX is a random geometric graph with 2-^ nodes where nodes represent random points in the 
unit square and edges connect nodes whose Euclidean distance is below CSSy^lnn/n. This threshold was chosen 
in order to ensure that the graph is almost connected. DelaunayX is the Delaunay triangulation of 2^ random 
points in the unit square. Graphs uk,3elt..fe_body and t60k..memplus come from Walshaw's benchmark archive 
|[30l . Graphs deu and eur, bel and nld are undirected versions of the road networks, used in JTOll . luxemburg is a 
road network taken from f3l. Our default number of partitions k are 2, 4, 8, 16, 32, 64 since they are the default 
values in [30| and in some cases we additionally use 128 and 256. Our default value for the allowed imbalance is 
3% since this is one of the values used in ll30ll and the default value in Metis. Our default number of PEs is 16. 



Methodology. We mostly present two kinds of data: average values and plots that show the evolution of solution 
quality (convergence plots). In both cases we perform multiple repetitions. The number of repetitions is dependent 
on the test that we perform. Average values over multiple instances are obtained as follows: for each instance 
(graph, k), we compute the geometric mean of the average edge cut values for each instance. We now explain how 
we compute the convergence plots. We start explaining how we compute them for a single instance /: whenever a 
PE creates a partition it reports a pair (i, cut), where the timestamp t is the currently elapsed time on the particular 
PE and cut refers to the cut of the partition that has been created. When performing multiple repetitions we report 
average values (t, avgcut) instead. After the completion of KaFFPaE we are left with P sequences of pairs (t, cut) 
which we now merge into one sequence. The merged sequence is sorted by the timestamp t. The resulting sequence 
IS called T^. Since we are interested in the evolution of the solution quality, we compute another sequence T^j^. 
For each entry (in sorted order) in we insert the entry {t, uimf^t cut(f )) into Tj^jn. Here minf/^t cut(t') is the 
minimum cut that occurred until time t. N^^^ refers to the normalized sequence, i.e. each entry (t, cut) in T^^^ is 
replaced by (t„, cut) where t„ = t/t/ and tj is the average time that KaFFPa needs to compute a partition for the 
instance /. To obtain average values over multiple instances we do the following: for each instance we label all 
entries in N^:^^, i.e. (t„, cut) is replaced by (t„, cut, /). We then merge all sequences N^^-^^ and sort by t„. The 
resulting sequence is called S. The final sequence Sg presents event based geometric averages values. We start 
by computing the geometric mean cut value Q using the first value of all N^:^^^ (over I). To obtain Sg we basically 
sweep through S: for each entry (in sorted order) c, /) in S we update Q, i.e. the cut value of / that took part 
in the computation of Q is replaced by the new value c, and insert {tn,Q) into Sg. Note that c can be only smaller 
or equal to the old cut value of /. 



6.1 Parameter Tuning 

We now tune the fraction parameter / and the ratio between mutation and crossover operations. For the parameter 
tuning we choose our small testset because runtimes for a single graph partitioner call are not too large. To save 
runtime we focus on A; = 64 for tuning the parameters. For each instance we gave KaFFPaE ten minutes time and 
16 PEs to compute a partition. During this test the quick start option is disabled. 

For this test the flip coin parameter c is set to one. In Figure |5] we can see that the algorithm is not too sensitive 
about the exact choice of this parameter. However, larger values of / speed up the convergence rate and improve 
the result achieved in the end. Since / = 10 and / = 50 are the best parameter in the end, we choose / = 10 as 
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Figure 5: Conv. plots for the fraction f using c = 1 (left) and the flip coin c using / = 10 (right). 

10-c mutation and crossover operations, we set / to ten. We can see 



20 50 100 
normalized time tn 

1 (left) and the flip coin c using / 



our default value. For tuning the ratio ^ : 
that for smaller values of c the algorithm is not too sensitive about the exact choice of the parameter. However, if 
the c exceeds 8 the convergence speed slows down which yields worse average results in the end. We choose c = 1 
because it has a slight advantage in the end. The parameter tuning uses KaFFPaStrong as a partitioned We also 



performed the parameter tuning using KaFFPaEco as a partitioner (see Appendix B. 1 1. 
6.2 Scalability 

In this Section we study the scalability of our algorithm. We do the following to obtain a fair comparison: basically 
each configuration has the same amount of time, i.e. when doubling the number of PEs used, we divide the time 
that KaFFPaE has to compute a partition per instance by two. To be more precise, when we use one PE KaFFPaE 
has ti = 15360s to compute a partition of an instance. When KaFFPaE uses p PEs, then it gets time tp = ti/p to 
compute a partition of an instance. For all the following tests the quick start option is enabled. To save runtime we 
use our small sized testset and fix k to 64. Here we perform five repetitions per instance. We can see in Figure [6] 
that using more processors speeds up the convergence speed and up to p = 128 also improves the quality in the end 
(in these cases the speedups are optimal in the end). This might be due to island effects [ 1 1. For p = 256 results are 
worse compared top = 1. This is because the algorithm is barely able to perform combine and mutation steps, due 
to the very small amount of time given to KaFFPaE (60 seconds). On the largest graph of the testset (delaunayl6) 
we need about 20 seconds to create a partition into /c = 64 blocks. 

We now define pseudo speedup 5p(t„) which is a measure for speedup at a particular normalized time tn 
of the configuration using one PE. Let Cp(t„) be the mean minimum cut that KaFFPaE has computed using p 
PEs until normalized time t„. The pseudo speedup is then defined as Sp{tn) = c'i{tn) / c'p{tn) where c'^{tn) = 
m.mc^(^t')<ci{t„) i'- If Cp(t) > c'^(tn) for all t we set Sp{tn) = (in this case the parallel algorithm is not able to 
compute the result computed by the sequential algorithm at normalized time t„; this is only the case for p = 256). 
We can see in Figure|6]that after a short amount of time we reach super linear pseudo speedups in most cases. 
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Figure 6: Scalability of our algorithm: (left) a normal convergence plot, (middle) mean minimum cut relative to 



best cut of KaFFPaE using one PE, (right) pseudo speedup Sp{tn) (larger versions can be found in Appendix B.3 1. 
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Figure 7: Convergence plots for the comparison of KaFFPaE with repeated executions of KaFFPa. 



6.3 Comparison with KaFFPa and other Systems 

In this Section we compare ourselves with repeated executions of KaFFPa 
and other systems. We switch to our middle sized testset to avoid the ef- 
fect of overtuning our algorithm parameters to the instances used for calibra- 
tion. We use 16 PEs and two hours of time per instance when we use KaFF- 
PaE. We parallelized repeated executions of KaFFPa (embarrassingly parallel, 
different seeds) and also gave 16 PEs and two hours to KaFFPa. We look 
at k £ {2, 4, 8, 16, 32, 64, 128, 256} and performed three repetitions per in- 
stance. Figure [t] show convergence plots for k € {32,64,128,256}. All 
convergence plots can be found in the Appendix |B.2[ As expected the im- 
provements of KaFFPaE relative to repeated executions of KaFFPa increase 
with increasing k. The largest improvement is obtained for k = 128. Here 
KaFFPaE produces partitions that have a 3.9% smaller cut value than plain 
restarts of the algorithm. Note that using a weaker base case partitioner, e.g. 
KaFFPaEco, increases this value. On the small sized testset we obtained an improvement of 5.9% for k = 64 
compared to plain restarts of KaFFPaEco. Tables comparing KaFFPaE with the best results out of ten repetitions 
of Scotch and Metis can be found in the Appendix Table[4] Overall, Scotch and Metis produce 19% and 28% larger 
(best) cuts than KaFFPaE respectively. However, these methods are much faster than ours (Appendix Table |4]l. 

6.4 Combine Operator Experiments 



fc/Algo. 


Reps. 


KaFFPaE 




Avg. 


impr. % 


2 


569 


0.2% 


4 


1229 


1.0% 


8 


2206 


1.5% 


16 


3 568 


2.7% 


32 


5481 


3.4% 


64 


8141 


3.3% 


128 


11937 


3.9% 


256 


17262 


3.7% 


overall 


3 872 


2.5% 



Table 1 : Different algorithms after 
two hours of time on 16 PEs. 



Algo. 


S3R 


K3R 


KC 


SC 


k 


Avg. 


improvement % 


2 


591 


2.4 


1.6 


0.2 


4 


1304 


3.4 


4.0 


0.2 


8 


2 336 


3.7 


3.6 


0.2 


16 


3 723 


2.9 


2.0 


0.2 


32 


5 720 


2.7 


3.3 


0.0 


64 


8463 


2.8 


3.0 


-0.6 


128 


12435 


3.6 


4.5 


0.0 


256 


17915 


3.4 


4.2 


-0.1 



We now look into the effectiveness of our combine operator Ci. We 
conduct the following experiment: we compare the best result of three 
repeated executions of KaFFPa {K3R) against a combine step (KC), i.e. 
after creating two partitions we report the result of the combine step 
Ci combining both individuals. The same is done using the combine 
operator of Soper et. al. |[28l (SC), i.e. we create two individuals using 
perturbed edge weights as in f28l and report the cut produced by the 
combine step proposed there (the best out of the three individuals). We 
also present best results out of three repetitions when using perturbed 
edge weights as in Soper et. al. {S3R). Since our partitioner does not 
support double type edge weights, we computed the perturbations and 
scaled them by a factor of 10000 (for S3R and SC). We performed ten 
repetitions on the middle sized testset. Results are reported in Table [2] A table presenting absolute average values 
and comparing the runtime of these algorithms can be found in Appendix Table |5] We can see that for large k 
our new combine operator yields improved partition quality in compareable or less time (KC vs. K3R)). Most 
importantly, we can see that edge biases decrease the solution quality (K3R vs. S3R). This is due to the fact that 
edge biases make edge cuts optimial that are not close to optimial in the unbiased problem. For example on 2D 
grid graphs, we have straight edge cuts that are optimal. Random edge biases make bended edge cuts optimal. 
However, these cuts are are not close to optimal cuts of the original graph partitioning problem. Moreover, local 
search algorithms (Flow-based, FM-based) work better if there are a lot of equally sized cuts. 



Table 2: Comparison of quality of dif- 
ferent algorithms relative to S3R. 
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6.5 Walshaw Benchmark 



We now apply KaFFPaE to Walshaw's benchmark archive f28l using the rules used there, i.e., running time is 
not an issue but we want to achieve minimal cut values for k G {2, 4, 8, 16, 32, 64} and balance parameters e G 
{0, 0.01, 0.03, 0.05}. We focus on e G {1%, 3%, 5%} since KaFFPaE (more precisely KaFFPa) is not made for 
the case e = 0. We run KaFFPaE with a time limit of two hours using 16 PEs (two nodes of the cluster) per 
graph, k and e and report the best results obtained in the Appendix |D] KaFFPaE computed 300 partitions which 
are better than previous best partitions reported there: 91 for 1%, 103 for 3% and 106 for 5%. Moreover, it 
reproduced equally sized cuts in 170 of the 312 remaining cases. When only considering the 15 largest graphs 
and e G {1.03, 1.05} we are able to reproduce or improve the current result in 224 out of 240 cases. Overall our 
systems (including KaPPa, KaSPar, KaFFPa, KaFFPaE) now improved or reproduced the entrys in 550 out of 612 
cases (for e G {0.01, 0.03, 0.05}). 

6.6 Comparison with PUNCH 

In this Section we focus on finding partitions for road networks. 
We implemented a specialized algorithm. Buffoon, which is sim- 
ilar to PUNCH ifTTI in the sense that it also uses natural cuts as 
a preprocessing technique to obtain a coarser graph on which the 
graph partitioning problem is solved. For more information on nat- 
ural cuts, we refer the reader to 1 1 1 1. Using our (shared memory) 
parallelized version of natural cut preprocessing we obtain a coarse 
version of the graph. Note that our preprocessing uses slightly dif- 
ferent parameters than PUNCH (using the notation of [11], we use 
C = 2,U = (1 + e)^!, / = 10, a = 1). Since partitions of the 
coarse graph correspond to partitions of the original graph, we use 
KaFFPaE to partition the coarse version of the graph. 

After preprocessing, we gave KaFFPaE teur.fc = k x 3.75 min 
on europe and tger,fc = k x 0.9375 min on germany, to compute 
a partition. In both cases we used all 16 cores (hyperthreading 
active) of machine B for preprocessing and for KaFFPaE. The ex- 
periments where repeated ten times. A summary of the results is 
shown in Table [3] Interestingly, on germany already our average 
values are smaller or equal to the best result out of 100 repetitions 
obtained by PUNCH. Overall in 9 out of 12 cases we compute a 
best cut that is better or equal to the best cut obtained by PUNCH. 
Note that for obtaining the best cut values we invest significantly more time than PUNCH. However, their machine 
is about a factor two faster (12 cores running at 3.33GHz compared to 8 cores running at 2.67GHz) and our algo- 
rithm is not tuned for road networks. A table comparing the results on road networks against KaFFPa, KaSPar, 
Scotch and Metis can be found in Appendix[6] These algorithms produce 9%, 12%, 93% and 288% larger cuts on 
average respectively. 

7 Conclusion and Future Work 

KaFFPaE is an distributed evolutionary algorithm to tackle the graph partitioning problem. Due to new crossover 
and mutation operators as well as its scalable parallelization it is able to compute the best known partitions for 
many standard benchmark instances in only a. few minutes. We therefore believe that KaFFPaE is still helpful in 
the area of high performance computing. 

Regarding future work, we want to integrate other partitioners if they implement the possibility to block edges 
during the coarsening phase and use the given partitioning as initial solution. It would be interesting to try other 
domain specific combine operators, e.g. on social networks it could be interesting to use a modularity clusterer to 
compute a clustering for the combine operation. 



grp, k 


algorithm/runtime t 


ger. 


Pbest 


^total 


Bavg 


^avg 


Bbest 


2 


164 


83 


161 


6 


161 


4 


400 


96 


394 


6 


393 


8 


711 


102 


694 


9 


693 


16 


1 144 


83 


1 148 


16 


1137 


32 


1960 


71 


1928 


31 


1898 


64 


3165 


83 


3 164 


62 


3143 


eur. 


Pfeest 


^total 


Bavg 


^avg 


Bbest 


2 


129 


423 


149 


39 


129 


4 


309 


358 


313 


39 


310 


8 


634 


293 


693 


47 


659 


16 


1293 


252 


1261 


73 


1238 


32 


2289 


217 


2 259 


130 


2240 


64 


3 828 


241 


3 856 


248 


3825 



Table 3: Results on road networks: best re- 
sults of PUNCH (P) out of 100 repetitions and 
total time [m] needed to compute these re- 
sults; average and best cut results of Buffoon 
(B) as well as average runtime [m] (including 
preprocessing). 
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A Karlsruhe Fast Flow Partitioner 



We now provide a brief overview over the techniques used in the underlying graph partitioner which is used a graph 
partitioner later. KaFFPa 1 26 1 is a classical matching based multilevel graph partitioner. Recall that a multilevel 
graph partitioner basically has three phases: coarsening, initial partitioning and uncoarsening. 

KaFFPa makes contraction more systematic by separating two issues: A rating function indicates how much 
sense it makes to contract an edge based on local information. A matching algorithm tries to maximize the sum of 
the ratings of the contracted edges looking at the global structure of the graph. While the rating functions allows 
a flexible characterization of what a "good" contracted graph is, the simple, standard definition of the matching 
problem allows to reuse previously developed algorithms for weighted matching. Matchings are contracted until 
the graph is "small enough". In ifTTll we have observed that the rating function expansion* 
works best among other edge rating functions, so that this rating function is also used in KaFFPa. 

We employed the Global Path Algorithm ( GPA j as a matching algorithm. It was proposed in 1 2 1 1 as a synthesis 
of the Greedy algorithm and the Path Growing Algorithm lITSll . This algorithm achieves a half- approximation in the 
worst case, but empirically, GPA gives considerably better results than Sorted Heavy Edge Matching and Greedy 
(for more details see [17 ]). GPA scans the edges in order of decreasing weight but rather than immediately building 
a matching, it first constructs a collection of paths and even cycles. Afterwards, optimal solutions are computed for 
each of these paths and cycles using dynamic programming. 

The contraction is stopped when the number of remaining nodes is below max (60A;, n/ (60/c)). The graph is 
then small enough to be partitioned by some initial partitioning algorithm. KaFFPa employs Scotch as an initial 
partitioner since it empirically performs better than Metis. 

Recall that the refinement phase iteratively uncontracts the matchings contracted during the contraction phase. 
After a matching is uncontracted, local search based refinement algorithms move nodes between block boundaries 
in order to reduce the cut while maintaining the balancing constraint. Local improvement algorithms are usually 
variants of the FM-algorithm [ 14| . The algorithm is organized in rounds. In each round, a priority queue P is used 
which is initialized with all vertices that are incident to more than one block, in a random order. The priority is 
based on the gain g{v) = maxp gp{v) where gp{v) is the decrease in edge cut when moving v to block P. Ties 
are broken randomly if there is more than one block that yields the maximum gain when moving v to it. Local 
search then repeatedly looks for the highest gain node v. Each node is moved at most once within a round. After 
a node is moved its unmoved neighbors become eligible, i.e. its unmoved neighbors are inserted into the priority 
queue. When a stopping criterion is reached all movements to the best found cut that occurred within the balance 
constraint are undone. This process is repeated several times until no improvement is found. 

During the uncoarsening phase KaFFPa additionally uses more advanced refinement algorithms. The first 
method is based on max-flow min-cut computations between pairs of blocks, i.e., a method to improve a given 
bipartition. Roughly speaking, this improvement method is applied between all pairs of blocks that share a non- 
empty boundary. The algorithm basically constructs a flow problem by growing an area around the given boundary 
vertices of a pair of blocks such that each min cut in this area yields a feasible bipartition of the original graph 
within the balance constraint. This yields a locally improved fc-partition of the graph. The second method for 
improving a given partition is called multi-try FM. Roughly speaking, a fc-way local search initialized with a single 
boundary node is repeatedly started. Previous methods are initialized with all boundary nodes. 

KaFFPa extended the concept of iterated multilevel algorithms which was introduced by |[29]| . The main idea 
is to iterate the coarsening and uncoarsening phase. Once the graph is partitioned, edges that are between two 
blocks are not contracted. An F-cycle works as follows: on each level we perform at most two recursive calls using 
different random seeds during contraction and local search. A second recursive call is only made the second time 
that the algorithm reaches a particular level. As soon as the graph is partitioned, edges that are between blocks are 
not contracted. This ensures nondecreasing quality of the partition since our refinement algorithms guarantee no 
worsening and break ties randomly. These so called global search strategies are more effective than plain restarts 
of the algorithm. 
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B Additional Experimental Data 



B.l Further Parameter Tuning 

In this Section we perform parameter tuning using KaFFPaEco (a faster but not so powerful as KaFFPaStrong) as 
a base case partitioner. We start tuning the fraction parameter /. As before we set the flip coin parameter c to one. 
In Figure [5] we can see that the algorithm is not too sensitive about the exact choice of this parameter. As before, 
larger values of / speed up the convergence rate and improve the result achieved in the end. Since / = 50 is the 
best parameter in the end, we choose it as our default value. 

We now tune the ratio ^ : ^^j^ between mutation to crossover operations. For this test we set / = 50. The 
results a similar to the results achieved when using KaFFPaStrong as a base case partitioner. Again we can see that 
for smaller values of c the algorithm is not to sensitive about the exact choice of the parameter. When c = 10, i.e. 
no crossover operation is performed the convergence speed slows down which yields worse average results in the 
end. The results of c = 9 and c = 1 are comparable in the end. We choose c = 1 for consistency. 




5 10 20 50 100 500 2000 5 10 20 50 100 500 

normalized time t„ normalized time t,, 



Figure 8: Conv. plots for the fraction f using c = 1 (left) and the flip coin c using / = 50 (right). In both cases 
KaFFPaEco is used as a base case partitioner. 



B.2 Further Comparison Data 



fc/Algo. 


Reps. 


KaFFPaE 


Scotch 


Metis 




Avg. 


Avg. 


Best. 


iavg[s] 


Best. 


^avg[s] 


2 


569 


568 


671 


0.22 


711 


0.12 


4 


1229 


1217 


1486 


0.41 


1574 


0.13 


8 


2207 


2 173 


2 663 


0.62 


2831 


0.13 


16 


3 568 


3 474 


4192 


0.86 


4500 


0.14 


32 


5481 


5 298 


6437 


1.15 


6 899 


0.15 


64 


8 141 


7 879 


9 335 


1.46 


10306 


0.18 


128 


11937 


11486 


13427 


1.85 


14 500 


0.20 


256 


17 262 


16 634 


18 972 


2.28 


20341 


0.25 


overall 


3 872 


3 779 


4 507 


0.87 


4 835 


0.16 



Table 4: Averages of final values of different algorithms on the middlesized testset. KaFFPa (Reps) and KaFFPaE 
was given after two hours of time on 16 PEs per repetitions and instance. Average values of Metis and Scotch are 
average values of the best cut that occurred out of ten repetitions. 
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Figure 9: Convergence plots for the comparison with repeated executions of KaFFPa. 
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SC 
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avg. 




avg. 


m 


avg. 
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19 


577 


14 


582 


12 


590 


17 
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1304 


30 


1261 


28 


1254 


22 


1302 


27 


8 


2 336 


40 


2252 


45 


2 255 


36 


2 332 


41 


16 


3 723 


54 


3617 


67 


3 649 


57 


3 714 


61 


32 


5 720 


82 


5 569 


110 


5 540 


99 


5 722 


84 


64 


8463 


116 


8 236 


164 


8213 


146 


8512 


113 


128 


12435 


171 


12008 


239 


11895 


225 


12432 


162 


256 


17915 


217 


17 335 


327 


17 199 


329 


17 935 


232 



Table 5: Comparison of different combine operators. Average values of cuts and runtime. 
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B.3 Larger Scalability Plots 
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Figure 10: Scalability of our algorithm: (upper) a normal convergence plot, (middle) mean minimum cut relative 
to best cut of KaFFPaE using one PE, (lower) pseudo speedup Sp{tn). 



16 
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Table 6: Detailed per instance results for road networks. PUNCH was run 100 times, Buffoon 10 times and KaFFPa, 
KaSPar, Scotch and Metis where run 5 times. 
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Table 7: Basic properties of our benchmark set. 



D Detailed Walshaw Benchmark Results 
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Table 8: Computing partitions from scratch e = 1%. In each fe-column the results computed by KaFFPaE are on the left and the current Walshaw cuts 
presented on the right side. 
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Table 9: Computing partitions from scratch e = 3%. In each fe-column the results computed by KaFFPaE are on the left and the current Walshaw cuts 
presented on the right side. 
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Table 10: Computing partitions from 
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